-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: vep annotation (dockerised + google batch + airflow) #608
Conversation
…ts/gentropy into dsdo_airflow_batch_vep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! The step works as expected (job). Just look at my comments to put these functions inside the utils script
@@ -44,7 +44,11 @@ create-dev-cluster: build ## Spin up a simple dataproc cluster with all dependen | |||
--master-machine-type n1-standard-16 \ | |||
--initialization-actions=gs://genetics_etl_python_playground/initialisation/${VERSION_NO}/install_dependencies_on_cluster.sh \ | |||
--metadata="PACKAGE=gs://genetics_etl_python_playground/initialisation/${VERSION_NO}/gentropy-${VERSION_NO}-py3-none-any.whl,CONFIGTAR=gs://genetics_etl_python_playground/initialisation/${VERSION_NO}/config.tar.gz" \ | |||
--single-node \ | |||
--num-workers 4 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't set these specs by default, I suggest reverting the changes
import time | ||
from dataclasses import dataclass | ||
from pathlib import Path | ||
from typing import Any, List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python 3.10 supports list
as type hints
from typing import Any, List | |
from typing import Any |
return runnable | ||
|
||
|
||
def create_task_spec(image: str, commands: List[str]) -> batch_v1.TaskSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is useful inside common_airflow.py
. I've used it in my branch to submit Batch jobs. I suggest moving it
return runnable | ||
|
||
|
||
def create_task_spec(image: str, commands: List[str]) -> batch_v1.TaskSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def create_task_spec(image: str, commands: List[str]) -> batch_v1.TaskSpec: | |
def create_task_spec(image: str, commands: list[str]) -> batch_v1.TaskSpec: |
return task | ||
|
||
|
||
def set_up_mouting_points( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is useful inside common_airflow.py
. I've used it in my branch to submit Batch jobs. I suggest moving it
|
||
Args: | ||
image (str): The Docker image to use. | ||
commands (List[str]): The commands to run in the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commands (List[str]): The commands to run in the container. | |
commands (list[str]): The commands to run in the container. |
return volumes | ||
|
||
|
||
def create_batch_job( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is useful inside common_airflow.py
. I've used it in my branch to submit Batch jobs. I suggest moving it
return list(self.path_dictionary.values()) | ||
|
||
|
||
def create_container_runnable(image: str, commands: List[str]) -> batch_v1.Runnable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is useful inside common_airflow.py
. I've used it in my branch to submit Batch jobs. I suggest moving it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this function to be able to pass params to the container (to run the gentropy step with the HYDRA_FULL_ERROR=1
env variable.
I suggest the following refactoring:
def create_container_runnable(
image: str, commands: list[str], **kwargs: Any
) -> batch_v1.Runnable:
"""Create a container runnable for a Batch job with additional optional parameters.
Args:
image (str): The Docker image to use.
commands (list[str]): The commands to run in the container.
kwargs (Any): Additional optional parameters to set on the container.
Returns:
batch_v1.Runnable: The container runnable.
"""
container = batch_v1.Runnable.Container(
image_uri=image, entrypoint="/bin/sh", commands=commands, **kwargs
)
return batch_v1.Runnable(container=container)
gcs_volume.gcs = gcs_bucket | ||
gcs_volume.mount_path = mount["mount_point"] | ||
volumes.append(gcs_volume) | ||
return volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I've understand the necessity of mounting the volumes in the case of VEP. Could you briefly explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VEP is a command line application, which cannot read or write files from google cloud location. So the folders containing cache, input and output files, need to be mounted.
* feat: custom dockerfile to run ensembl vep * ci: automate vep image build and artifact registry * chore: update airflow google operators (not required) * feat: working version of google batch airflow vep job * feat: working version of google batch airflow vep job * feat(VEP): adding CADD plugin * feat: local loftee file * feat: working with input bucket full of input files * feat: prevent writing html * fix: minor adjustments to retry strategy * feat(airflow): separating mounting points for input/output and cache * fix: typo in airflow dag * fix: pre-commit pain * chore: rename airflow dag file --------- Co-authored-by: DSuveges <[email protected]> Co-authored-by: Szymon Szyszkowski <[email protected]>
✨ Context
We need a mechanism to retrieve Variant Annotation from variants outside GnomAD. After some consideration, we concluded Ensembl VEP is the best stable source to retrieve the required annotations.
🛠 What does this PR implement
A mechanism to run Ensembl VEP for a given set of variants. WIP (more details soon)
🙈 Missing
WIP
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?